Applying BLAST to Text Reuse Detection
نویسندگان
چکیده
We present the results of text reuse detection, based on the corpus of scanned and OCR-recognized Finnish newspapers and journals from 1771 to 1910. Our study draws on BLAST, a software created for comparing and aligning biological sequences. We show different types of text reuse in this corpus, and also present a comparison to the software Passim, developed at the Northeastern University in Boston, for text reuse detection.
منابع مشابه
Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملPostgraduate Transfer Report.PDF
This thesis builds upon our current understanding of text reuse by proposing a hypothetical framework of text reuse and applying this abstract definition to a specific domain, that of journalistic reuse. The framework aims to explore a suitable measure of reuse and determine suitable discriminators for document derivation. Although text can be reused verbatim (word-for-word), in most cases, tex...
متن کاملCOUNTER: corpus of Urdu news text reuse
Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavail...
متن کاملNon-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse
Text reuse refers to citing, copying or alluding text excerpts from a text resource to a new context. While detecting reuse in contemporary languages is well supported—given extensive research, techniques, and corpora— automatically detecting historical text reuse is much more difficult. Corpora of historical languages are less documented and often encompass various genres, linguistic varieties...
متن کاملComparison between Alternative Load Path Method and a Direct Applying Blast Loading Method in Assessment of the Progressive Collapse
Extensive research has been focused on the progressive collapse analysis of buildings and most of them are based on the alternative path method (APM) with sudden removal of one or several columns. However, in this method the damage of adjacent elements of removed columns under blast conditions was ignored and this issue can lead to an incorrect prediction of progressive collapse. Therefore, in ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017